53 research outputs found
Towards efficient deep neural networks with applications to visual recognition
The thesis focuses on the following two topics: designing energy-efficient neural
networks and hashing approach to make deep learning more feasible to real applications;
deep convolutional neural networks for visual recognition.Thesis (Ph.D.) (Research by Publication) -- University of Adelaide, School of Computer Science, 201
Towards Effective Low-bitwidth Convolutional Neural Networks
This paper tackles the problem of training a deep convolutional neural
network with both low-precision weights and low-bitwidth activations.
Optimizing a low-precision network is very challenging since the training
process can easily get trapped in a poor local minima, which results in
substantial accuracy loss. To mitigate this problem, we propose three
simple-yet-effective approaches to improve the network training. First, we
propose to use a two-stage optimization strategy to progressively find good
local minima. Specifically, we propose to first optimize a net with quantized
weights and then quantized activations. This is in contrast to the traditional
methods which optimize them simultaneously. Second, following a similar spirit
of the first method, we propose another progressive optimization approach which
progressively decreases the bit-width from high-precision to low-precision
during the course of training. Third, we adopt a novel learning scheme to
jointly train a full-precision model alongside the low-precision one. By doing
so, the full-precision model provides hints to guide the low-precision model
training. Extensive experiments on various datasets ( i.e., CIFAR-100 and
ImageNet) show the effectiveness of the proposed methods. To highlight, using
our methods to train a 4-bit precision network leads to no performance decrease
in comparison with its full-precision counterpart with standard network
architectures ( i.e., AlexNet and ResNet-50).Comment: 11 page
TasselNet: Counting maize tassels in the wild via local counts regression network
Accurately counting maize tassels is important for monitoring the growth
status of maize plants. This tedious task, however, is still mainly done by
manual efforts. In the context of modern plant phenotyping, automating this
task is required to meet the need of large-scale analysis of genotype and
phenotype. In recent years, computer vision technologies have experienced a
significant breakthrough due to the emergence of large-scale datasets and
increased computational resources. Naturally image-based approaches have also
received much attention in plant-related studies. Yet a fact is that most
image-based systems for plant phenotyping are deployed under controlled
laboratory environment. When transferring the application scenario to
unconstrained in-field conditions, intrinsic and extrinsic variations in the
wild pose great challenges for accurate counting of maize tassels, which goes
beyond the ability of conventional image processing techniques. This calls for
further robust computer vision approaches to address in-field variations. This
paper studies the in-field counting problem of maize tassels. To our knowledge,
this is the first time that a plant-related counting problem is considered
using computer vision technologies under unconstrained field-based environment.Comment: 14 page
Fast Vision Transformers with HiLo Attention
Vision Transformers (ViTs) have triggered the most recent and significant
breakthroughs in computer vision. Their efficient designs are mostly guided by
the indirect metric of computational complexity, i.e., FLOPs, which however has
a clear gap with the direct metric such as throughput. Thus, we propose to use
the direct speed evaluation on the target platform as the design principle for
efficient ViTs. Particularly, we introduce LITv2, a simple and effective ViT
which performs favourably against the existing state-of-the-art methods across
a spectrum of different model sizes with faster speed. At the core of LITv2 is
a novel self-attention mechanism, which we dub HiLo. HiLo is inspired by the
insight that high frequencies in an image capture local fine details and low
frequencies focus on global structures, whereas a multi-head self-attention
layer neglects the characteristic of different frequencies. Therefore, we
propose to disentangle the high/low frequency patterns in an attention layer by
separating the heads into two groups, where one group encodes high frequencies
via self-attention within each local window, and another group encodes low
frequencies by performing global attention between the average-pooled
low-frequency keys and values from each window and each query position in the
input feature map. Benefiting from the efficient design for both groups, we
show that HiLo is superior to the existing attention mechanisms by
comprehensively benchmarking FLOPs, speed and memory consumption on GPUs and
CPUs. For example, HiLo is 1.4x faster than spatial reduction attention and
1.6x faster than local window attention on CPUs. Powered by HiLo, LITv2 serves
as a strong backbone for mainstream vision tasks including image
classification, dense detection and segmentation. Code is available at
https://github.com/ziplab/LITv2.Comment: NeurIPS 2022 camera read
SwitchGPT: Adapting Large Language Models for Non-Text Outputs
Large Language Models (LLMs), primarily trained on text-based datasets,
exhibit exceptional proficiencies in understanding and executing complex
linguistic instructions via text outputs. However, they falter when requests to
generate non-text ones. Concurrently, modality conversion models, such as
text-to-image, despite generating high-quality images, suffer from a lack of
extensive textual pretraining. As a result, these models are only capable of
accommodating specific image descriptions rather than comprehending more
complex instructions. To bridge this gap, we propose a novel approach,
\methodname, from a modality conversion perspective that evolves a text-based
LLM into a multi-modal one. We specifically employ a minimal dataset to
instruct LLMs to recognize the intended output modality as directed by the
instructions. Consequently, the adapted LLM can effectively summon various
off-the-shelf modality conversion models from the model zoos to generate
non-text responses. This circumvents the necessity for complicated pretraining
that typically requires immense quantities of paired multi-modal data, while
simultaneously inheriting the extensive knowledge of LLMs and the ability of
high-quality generative models. To evaluate and compare the adapted multi-modal
LLM with its traditional counterparts, we have constructed a multi-modal
instruction benchmark that solicits diverse modality outputs. The experiment
results reveal that, with minimal training, LLMs can be conveniently adapted to
comprehend requests for non-text responses, thus achieving higher flexibility
in multi-modal scenarios. Code and data will be made available at
https://github.com/xinke-wang/SwitchGPT
Parallel Attention: A Unified Framework for Visual Object Discovery through Dialogs and Queries
Recognising objects according to a pre-defined fixed set of class labels has
been well studied in the Computer Vision. There are a great many practical
applications where the subjects that may be of interest are not known
beforehand, or so easily delineated, however. In many of these cases natural
language dialog is a natural way to specify the subject of interest, and the
task achieving this capability (a.k.a, Referring Expression Comprehension) has
recently attracted attention. To this end we propose a unified framework, the
ParalleL AttentioN (PLAN) network, to discover the object in an image that is
being referred to in variable length natural expression descriptions, from
short phrases query to long multi-round dialogs. The PLAN network has two
attention mechanisms that relate parts of the expressions to both the global
visual content and also directly to object candidates. Furthermore, the
attention mechanisms are recurrent, making the referring process visualizable
and explainable. The attended information from these dual sources are combined
to reason about the referred object. These two attention mechanisms can be
trained in parallel and we find the combined system outperforms the
state-of-art on several benchmarked datasets with different length language
input, such as RefCOCO, RefCOCO+ and GuessWhat?!.Comment: 11 page
- …